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Codemasters: 28 Years of Cross 
Platform AAA Development 



Differentiate PC through cutting edge technology 

■ Multi-threading (2007) 

■ Early mover on DX1 1 (2009) 

■ Tessellation, Compute 

■ Forward+ Lighting (201 2) 







The world is Changing 

Common PC Resolutions 



■ 1920x1080 

32 

■ 1366x768 

23 

■ 1600x900 

7 

1280x1024 

7 

■ 1440x900 
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Source Steam Power Hardware survey Jan 2014 


Most popular PC GPU's 




Goals in 2013/14? 


■ Make medium settings match current consoles 

■ Scale up and down from console quality 



High end Desktop & 
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A 


PC 




Ultraportable Legacy PC Tablet 



Goals in 2013/14? 


■ Make medium settings match current consoles 

■ Scale up and down from console quality 



Ultra Low? 


Technoloav R&D 


GPU Optimizations 


Power Optimizations 


Rip up the 
rule book 
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Pixel Shader Ordering (Making the impossible possible!) 


What is it? 

•Pixel Shader Mutex at a screen location 
•Fast as only colliding threads are serialized 
•Guaranteed Execution Order 

Similar to Alpha Blending rules 
•Pixels written in SV_PrimitivelD order 
•Pixel Shader Ordering moves this guaranteed 
ordering into the Pixel Shader 


Introduced with 4 th Generation 
Intel Core Processors 



What can we use this for? 

•Anything that wants to read-modify-write a 
per-pixel data structure 


Overlapping pixels can execute in parallel 
without Pixel Shader Ordering 
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Pixel Shader Ordering 




Technology R&D 

How AVSM and OIT made their way into GRID 2 
Sneak peek at programmable blending 



OIT (Order Independent Transparency) 

■ Represent multiple layers of transparency without sorting problems 

■ Denser/softer looking foliage (especially in distance due to mips) 

■ Improved other alpha tested geometry 





Store Visibility Function as a sorted fixed-size array of nodes, in UAV surface 
Each red node corresponds to a pair of values for depth and transmittance: ( d , f ) 


To compress visibility we remove the node that generates the smallest area variation 
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struct T ransparencyData 
{ 

float depth[ MAX_LAYERS ]; 
f loat4 colour[ MAXJ.AYERS ]; 


}; 
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Store Visibility Function as a sorted fixed-size array of nodes, in UAV surface 
Each red node corresponds to a pair of values for depth and transmittance: ( d , /") 


To compress visibility we remove the node that generates the smallest area variation 
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struct T ransparencyData 
{ 

float depth[ MAX_LAYERS ]; 
f loat4 colour[ MAXJ.AYERS ]; 

}; 


final color = 


Y^CtCCfVisiZi) 


Final full screen resolve to composite OIT data with main image 


Alpha Coverage in GRID2 


Lot of semi-transparency in foliage (See Red Mask) 

Alpha blending isn't an option.... 

Original system used Alpha Z Coverage, but requires 
4xMSAA to look good. 

Without Blending or A2C we get aliased results 




Draw Order Problems? 

Must composite full screen quad amongst other transparencies 
Result of not doing full scene OIT 
Organise draw order to solve most problems 
God rays/Haze required a different approach 
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Result of not doing full scene OIT 
Organise draw order to solve most problems 
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Draw Order Solution 


Can't put god ray polygons into OIT 

■ They are far too big! 

■ Only need OIT on the areas that overlap 

Solution: Only add haze/god-ray pixels into the OIT if there is already tree OIT data 
the buffer 

■ Store mask of pixels in R32 target to flag pixels that contain OIT data 

■ Use the mask to identify pixels where the haze/god rays need to be rendered with OIT 

■ All other pixels are rendered with normal alpha blending, because there is no overlap 



OIT Performance 


2-node OIT 

Tiled memory access 
a (y* width ) + x is bad! 

Clear mask texture 

Worst case of 3.1 ms (2.5ms + 0.6ms)* 
Typical case of around 2ms* 



* On a 4th Gen Core™ Processor with Intel® Iris™ Pro Graphics, at 1 600x900 



AVSM 


■ Adaptive Volumetric Shadow Mapping 

■ Can we use a similar idea to approximate light transmittance through participating media? 

■ Render OIT from the light's point of view 

■ Can be used to render volumetric smoke effects, such as tyre kickup 


AVSM - Problem Background 


Realistic lighting of volumetric media 

■ Hair, smoke, fog, etc.. 


Compute visibility curve 

■ Transmittance: Fraction of light that passes 
through a material 
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Original Tyre Smoke 
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Lighting particles 

Original R&D focused on optimizing shadow map 



Lighting particles 


Original R&D focused on optimizing shadow map 
Read/Write 
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■ Pixel Shading 

■ Tessellated Per Vertex 


Per Pixel lighting of particles took >=1 Oms... 


■ Per vertex lighting is too coarse. 


■ Per vertex with screen space tessellation actually 
looked better! 



■ Tessellated per vertex is Z-3x faster 



Particle Sorting 

Problem: Overlapping emitters weren't sorting correctly 
Idea: Use OIT! CPU sorting? 

■ Requires high node count (1 6 nodes!) ■ Facing billboards makes CPU sorting possible 

■ Uses precious GPU performance ■ We had spare performance on the CPU 
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Problem: Overlapping emitters weren't sorting correctly 
Idea: Use OIT! CPU sorting? 

■ Requires high node count (1 6 nodes!) ■ Facing billboards makes CPU sorting possible 

■ Uses precious GPU performance ■ We had spare performance on the CPU 



AVSM Performance 


Actual worst case of 1 5ms (9ms + 6ms)* 

Typical worst case of 4.5ms (2.5ms + 2ms)* 
Average case much lower (2ms for entire effect) 



* On a 4th Gen Core™ Processor with Intel® Iris™ Pro Graphics, at 1 600x900 









Programmable Blending 

•HDR lighting values encoded logarithmically into R1 0G1 OBI 0A2 back buffer 
•Fixed function alpha blending of encoded values is invalid 
•Result is loss of high dynamic range behind transparencies 
•Solution is to blend in linear space 


Programmable Blending 

•HDR lighting values encoded logarithmically into R1 0G1 OBI 0A2 back buffer 
•Fixed function alpha blending of encoded values is invalid 
•Result is loss of high dynamic range behind transparencies 
•Solution is to blend in linear space 



GPU and CPU Optimizations: 

• Different to optimizing a system using discrete graphics 

• Some optimizations were counter intuitive 

• Optimizing for Power and Bandwidth played a big part in getting expected 


performance 




Performance scaling 

Relative Graphics benchmark performance 
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Performance scaling 

Relative Graphics benchmark performance 
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Welcome to the world of power sharing 


CPU and GPU share the Thermal Design 
Power (TDP) rating for the system. 

CPU and GPU have maximum allowed 
frequencies, you get one or the other, 
not both at the same time! 

In graphics benchmarks the load sharing 
looks like 



Gfx frequency 
Max 
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o CPU Power (Watts) 


33 




Games look more like this! 


TDP shared more evenly between CPU 
and GPU. 

Audio, Al, higher graphics API overheads 
all lead to higher CPU usage. 

Higher CPU requirements means can be 
hard to hit Max Gfx frequency. 



CPU Power ( Watts) 
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Lower the TDP, more aggressive trade-off 


Max CPU and GPU frequency might not 
change much at lower TDP, but you can't 
get both at the same time 



CPU Power ( Watts) 





So What happens when you optimize graphics? 


If profiling tells you that your GPU 
bound, then optimizing the GPU will 
improve performance right???? 



CPU Power ( Watts) 




So What happens when you optimize graphics? 


If profiling tells you that your GPU 
bound, then optimizing the GPU will 
improve performance right???? 


You have a great day and save 20% of 
your GPU workload© Do you get 20% 
extra FPS?? 



o CPU Power (Watts) 
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So What happens when you optimize graphics? 


If profiling tells you that your GPU 
bound, then optimizing the GPU will 
improve performance right???? 


You have a great day and save 20% of 
your GPU workload© Do you get 20% 
extra FPS?? 

Extra FPS normally requires more CPU to 
drive the workload. 



o CPU Power (Watts) 




GPU bound? Optimize the CPU!! 


Sounds crazy but increasingly common. 

• GPU and CPU share power budget 

• Frequencies dynamically adjusted at 
run time based on workload 

• Optimizing one gives more power to 
the other 

• Base CPU frequency can be very 

misleading 



o CPU Power (Watts) 
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GPU bound? Optimize the CPU!! 


Sounds crazy but increasingly common. 

• GPU and CPU share power budget 
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Power isn't the only thing shared! 


Up to 1.7Gb of system 
memory 

Connected to CPU via ring bus 
Shared 11$. 

Bandwidth to system 
shared between CPU 
and GPU. 



CPU Package 
Intel 4 rd Gen Core™ chip 
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The Juggling Game continues 


Off package bandwidth doesn't 
change much as TDP increases. 

Increasing GPU or CPU 
workload increases bandwidth 
requirements 


CPU Package 



CPU Power ( Watts) 
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Can you feed the system fast enough? 

• If a higher TDP doesn't give much more performance, check how busy the GPU is. 

• EU stalls can often be either directly caused by waiting on RAM, or indirectly via 
the sampler. 

• Can be checked in Intel GPA. 




Iris™ Pro 

On same package as CPU 
128MB 

Bandwidth 50GB/sec each way 
Acts as 4 th level cache 


System 

Memory 


Just works, no API required 


EDRAM 


CPU Package 
Intel 4 rd Gen Core™ chip 



Gfx Core 
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Iris™ Pro 

On same package as CPU 
128MB 

Bandwidth 50GB/sec each way 
Acts as 4 th level cache 


System 

Memory 


Just works, no API required 

* © 2014 IEEE 

International Solid-State Circuits Conference 


EDRAM 


CPU Package 


eDRAM Speedup Examples * 


75% Speed up 
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SSAO: When R&D doesn't scale down... 


■ Disproportionately expensive on medium 
settings, 1 5-20% of a frame. 

■ Was CS based, difficult to optimize across 
multiple hardware vendors 

■ Very BW memory intensive 

■ 2 depth samples per occlusion result 

■ Smart cross bilateral blur reads from depth 
to determine edges 



■ Worked at % x Vz screen resolution 



OLD, low qualit 


OLD, high qualit 


IvvBridqe Ultrabook 


8.9ms 

10.8ms 


Haswell GT3e SDP* 


2.71ms 

4.2ms 


NVidia GTX 470 


1.33ms 

1.78ms 






SSAO: Reinventing The Wheel 

Based on Image-Space Horizon-Based 
Ambient Occlusion 

■ Completely PS-based, still Vz x Vz res 

■ Base cost for normal & edge detection + one 
depth sample for one occlusion result 

■ Smart cross bilateral blur uses edges from 
previous pass, doesn't read Depth 




NEW, low qualit 


NEW, high qualit 


IvvBridqe Ultrabook 


2.1ms 

3.8ms 


Haswell GT3e SDP* 


0.85ms 

1.38ms 


NVidia GTX 470 


0.41 ms 
0.84ms 


* Pre-production hardware 
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MSAA Performance 



• Pixel shader run once per sample ( Yellow dot). 
Coverage and Occlusion done at higher rates. 

• Storage required at a Subsample level, 
increases bandwidth and memory requirements 

• Costs vary on hardware and workload but its 
never free. 

Intel Iris Pro 5200 Graphics Review: Core i7- 
4950HQ Tested 

by Anand Lai Shimpi on June 1.2013 10:01 AM EST 48 
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NVIDIA GeForce GT 650M (rMBP15 90W) 


Intel Iris Pro 5200 (i7-4950HQ 55W) ! 
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AMD Radeon HD 7660D (A10-5800K) 


AMD Radeon HD 7660G (A10-4600M 35W) 30.6 


0 10 20 30 40 50 60 70 80 


. GRID 


NVIDIA GeForce GT 650M (rMBP15 90W) 


AMD Radeon HD 7660D (A10-5800K) 


Intel Iris Pro 5200 (i7-4950HQ 55W) 


Intel Iris Pro 5200 (i7-4950HQ 47W) 



AMD Radeon HD 7660G (A10-4600M 35W) 


10 20 30 40 50 60 70 







Post-Process AA as an alternative 






Evaluated two of the most common 
ones: SMAA 1 x and FXAA 3.1 1 for GRID2 

FXAA 3.1 1 - good performance but 
developer found it too blurry: (detail loss 
on text and high frequency textures) 

SMAA 1 x - a bit too costly to beat MSAA 
for forward rendering, still a bit blurry 

Started with "Morphological Antialiasing" 
[2009 Alexander Reshetov, Intel Labs] 



Post process that detects aliasing by 
analyzing colour discontinuities (edges), 
and applies smart blur to reduce aliasing 


49 



Enter Conservative Morphological AA (CMAA) 

• Based on MLAA, but solving only symmetrical Z 
shapes instead of U, Z and L-shapes 

• Better preservation of average image colour and 
temporal stability. 


• Conservative approach to determining and 
pruning edges, "if unsure, don't blur" 

• Overall less damaging and higher AA quality 
compared to FXAA 3.1 1 

• Tailored for Intel Haswell: as fast as FXAA 3.1 1 , 
twice as fast as SMAA 1 x. 



•he quick brown fox jumps over the lazy c 
the quick brown fox jumps over t 



the quick brown fox jumps over the lazy ( 
the quick brown fox jumps over t 


GRID2 performance comparison vs MSAA 
(milliseconds per frame) 



HSW 15W i5 HSW 28W i7 HSW 28W i7 
2+2, Medium, 2+3, Medium, 2+3, Medium, 
1366x768 1366x768 1600x900 


HSW 47W 
GT3e 4+3, 
High, 

1600x900 


HSW 47W 
GT3e 4+3, 
High, 

1920x1080 


■ CMAA 

■ 2xMSAA 

■ 4XMSAA 
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Road Pixel Shader 


•Blocked on pixel shaders 

• Eu pixel shader stall = 48.2% 

• Aniso filtering stalls samplers 

• 3.7 pixel cache lines accessed per 
sample. 

•New menu option added to scale artist 
set anisotropic levels. 

• <20% on medium 

• <5% on low. 
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Fullscreen Shadow Pass 


•Blocked on pixel shaders 

• Eu pixel shader stall = 42.3% 

• Originally read in 4 shadow textures 

• One shader for all quality settings 

• Clear textures read on lower settings 


•Stencil mask added to remove selected areas, 
such as the sky 

•Different shader used on medium and below, 
removed reads from particle shadow textures 


53 




Ripping Up The Rule Book 

And you thought 1 5watts was a challenge... 



A different approach to optimization 


•Can remove high end PC features 

• Tablet GPU performance was initially ~53ms per frame 

• But more scope for making aggressive changes 

• Scalability (Lots more graphics menu options!) 

• More selective use of Specular and Normal Maps, etc 

•Cheaper Shaders 

• Idea: use Environment Map shaders for our main scene? 

• This render pass is essentially a low quality version of our main colour pass 

• Too low quality in some cases 

• Saved 20ms GPU time! 



Fixing the Visuals 


•Using environment map shaders for main scene has consequences 

• Screen space maps all disappear (shadows, SSAO) 

• Seeing shader pass is a useful debug tool 

• Hard to see bugs in these shaders when looking directly into car reflection 

•Fallback Render Pass 

• Engine already supports a fallback render pass 

• If Primary pass doesn't exist for shader, use Secondary 

• Implemented new pass to fix specific problems 

• Undercar shadow was missing 

• Headlights no longer illuminated the track at night 

• Etc 



A bit more optimization 


•Other ideas 

• Texture LOD bias 

• Visual quality declines very quickly, and tests currently show negligible gains 

• Lower geometry LODs 

• Nearer draw distances 

• Billboard LODs for trees/crowd 

• Reduces vertex cost, and lighting costs 

• Simple Post Process 

• Only need tone mapping (which, for us, requires bloom) 

• Motion blur. Lens flare, etc all gone 



Low resolution particle rendering 


•An effective optimisation for console/PC 

• Reduce fill-rate by rendering particles at lower resolution, and combining them with the 
main framebuffer 

• 'A width and height 

•Fixed-cost overheads a lot higher on Tablet 

• Creating downsampled depth buffer 

• Upsample downsampled colour buffer 

•More efficient to render particles at full resolution, and sacrifice particle counts 

• High particle counts only occur during collisions, or when going off track 



Too optimised? 


•Game now running at over 40fps! 

•Lowest Preset used to be a "compatibility mode" 

• Particles, Crowd, Drivers, and Shadows were all disabled 

• Choose which systems we could turn back on to get us to 30fps 
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Upscaling is not just for consoles 


•HUD looked low quality at non-native Tablet resolutions 

•High resolution backbuffer, low resolution colour render target 

• Already support naive supersampling in engine 

• Easy to modify this feature to support smaller target, instead of larger 



Performance Before and After 


Pre Optimization 

4 


6 12 



■ Post Processing 
Track 
Scenery 

Car 

■ Particles 
-HUD 


Post Optimization 


■ Post Processing 
Track 
Scenery 

Car 

■ Particles 

■ HUD 

■ Upscaling 
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That's a wrap: 

1 . New extensions allow for visual differentiation, are power efficient 

2 . Existing algorithms can be significantly optimized 

3. Normal GPU optimization rules are subtlety different, bandwidth and power 
mean things are not always intuitive 

4. CMAA good low cost Post-Process Anti-Aliasing solution on all hardware, for 
cases when concerned about blurriness / image degradation. Not as good AA 
results compared to SMAA though (especially vs. more expensive variations) 

5. Sample & code available online at Intel web page 

6. Are you making the best visual trade offs on power constrained hardware? 


Can your engine scale from Tablet to high end PC???? 
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Questions? 




Intel Confidential — Do Not Forward 



Thanks for attending! 
Want to go further? 


Grid 2 http://tinyurl.com/buygrid2 


AVSM research 
PIT research 
PIT research 


AVSM sample 
PIT sample 
CMAA sample 



Ready for More? Look Inside™ 

Keep in touch with us at GDC and beyond: 

• Game Developer Conference 

Visit our Intel® booth #1 01 6 in Moscone South 

• Intel University Games Showcase 
Marriott Marquis Salon 7, Thursday 5:30pm 
RSVP at bit.ly/intelgame 

• Intel Developer Forum, San Francisco 
September 9-11, 2014 

intel.com/idf1 4 

• Intel Software Adrenaline 
@inteladrenaline 

• Intel Developer Zone 
software.intel.com 
@intelsoftware 
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